NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control

https://doi.org/10.1093/gpbjnl/qzae042

Chen, Yiling Elaine; Ge, Xinzhou; Woyshner, Kyla; McDermott, MeiLu; Manousopoulou, Antigoni; Ficarro, Scott B; Marto, Jarrod A; Li, Kexin; Wang, Leo David; Li, Jingyi Jessica (April 2024, Genomics, Proteomics & Bioinformatics)
Fu, Yan (Ed.)
Abstract Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide–spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.
more » « less
Full Text Available
Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification

Zhang, Chihao; Chen, Yiling Elaine; Zhang, Shihua; Li, Jingyi Jessica (October 2022, Journal of machine learning research)
Peng, Jie (Ed.)
Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications, including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at https://github.com/JSB-UCLA/ITCA) implements ITCA and the search strategies.
more » « less
Full Text Available
A flexible model-free prediction-based framework for feature ranking

Li, Jingyi Jessica; Chen, Yiling Elaine; Tong, Xin (May 2021, Journal of machine learning research)

Full Text Available
The concurrence of DNA methylation and demethylation is associated with transcription regulation

https://doi.org/10.1038/s41467-021-25521-7

Shi, Jiejun; Xu, Jianfeng; Chen, Yiling Elaine; Li, Jason Sheng; Cui, Ya; Shen, Lanlan; Li, Jingyi Jessica; Li, Wei (December 2021, Nature Communications)

Abstract The mammalian DNA methylome is formed by two antagonizing processes, methylation by DNA methyltransferases (DNMT) and demethylation by ten-eleven translocation (TET) dioxygenases. Although the dynamics of either methylation or demethylation have been intensively studied in the past decade, the direct effects of their interaction on gene expression remain elusive. Here, we quantify the concurrence of DNA methylation and demethylation by the percentage of unmethylated CpGs within a partially methylated read from bisulfite sequencing. After verifying ‘methylation concurrence’ by its strong association with the co-localization of DNMT and TET enzymes, we observe that methylation concurrence is strongly correlated with gene expression. Notably, elevated methylation concurrence in tumors is associated with the repression of 40~60% of tumor suppressor genes, which cannot be explained by promoter hypermethylation alone. Furthermore, methylation concurrence can be used to stratify large undermethylated regions with negligible differences in average methylation into two subgroups with distinct chromatin accessibility and gene regulation patterns. Together, methylation concurrence represents a unique methylation metric important for transcription regulation and is distinct from conventional metrics, such as average methylation and methylation variation.
more » « less
Full Text Available
DORGE: Discovery of Oncogenes and tumoR suppressor genes using Genetic and Epigenetic features

https://doi.org/10.1126/sciadv.aba6784

Lyu, Jie; Li, Jingyi Jessica; Su, Jianzhong; Peng, Fanglue; Chen, Yiling Elaine; Ge, Xinzhou; Li, Wei (November 2020, Science Advances)
null (Ed.)
Data-driven discovery of cancer driver genes, including tumor suppressor genes (TSGs) and oncogenes (OGs), is imperative for cancer prevention, diagnosis, and treatment. Although epigenetic alterations are important for tumor initiation and progression, most known driver genes were identified based on genetic alterations alone. Here, we developed an algorithm, DORGE (Discovery of Oncogenes and tumor suppressoR genes using Genetic and Epigenetic features), to identify TSGs and OGs by integrating comprehensive genetic and epigenetic data. DORGE identified histone modifications as strong predictors for TSGs, and it found missense mutations, super enhancers, and methylation differences as strong predictors for OGs. We extensively validated DORGE-predicted cancer driver genes using independent functional genomics data. We also found that DORGE-predicted dual-functional genes (both TSGs and OGs) are enriched at hubs in protein-protein interaction and drug-gene networks. Overall, our study has deepened the understanding of epigenetic mechanisms in tumorigenesis and revealed previously undetected cancer driver genes.
more » « less
Full Text Available

Search for: All records